Phylogeny and geometry of languages from normalized Levenshtein distance

نویسنده

  • Maurizio Serva
چکیده

The idea that the distance among pairs of languages can be evaluated from lexical differences seems to have its roots in the work of the French explorer Dumont D’Urville. He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to 1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation between languages. The method used by the modern lexicostatistics, developed by Morris Swadesh in the 1950s, measures distances from the percentage of shared cognates, which are words with a common historical origin. The weak point of this method is that subjective judgment plays a relevant role. Recently, we have proposed a new automated method which is motivated by the analogy with genetics. The new approach avoids any subjectivity and results can be easily replicated by other scholars. The distance between two languages is defined by considering a renormalized Levenshtein distance between pair of words with the same meaning and averaging on the words contained in a list. The renormalization, which takes into account the length of the words, plays a crucial role, and no sensible results can be found without it. In this paper we give a short review of our automated method and we illustrate it by considering the cluster of Malagasy dialects. We show that it sheds new light on their kinship relation and also that it furnishes a lot of new information concerning the modalities of the settlement of Madagascar.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated Word Stability and Language Phylogeny

The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D’Urville (1832). He collected comparative word lists of various languages during his voyages aboard the Astrolabe from 1826 to1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relationship among languages. The metho...

متن کامل

Automated words stability and languages phylogeny

The idea of measuring distance between languages seems to have its roots in the work of the French explorer Dumont D'Urville (D'Urville 1832). He collected comparative words lists of various languages during his voyages aboard the Astrolabe from 1826 to1829 and, in his work about the geographical division of the Pacific, he proposed a method to measure the degree of relation among languages. Th...

متن کامل

Automated languages phylogeny from Levenshtein distance

Languages evolve in time according to a process in which reproduction, mutation and extinction are all possible. This is very similar to haploid evolution for asexual organisms or for mtDNA of complex ones. Exploiting this similarity it is possible, in principle, to verify hypotheses concerning their relationship. The key point is the definition of the distance among pairs of languages in analo...

متن کامل

Lexical evolution rates by automated stability measure

Phylogenetic trees can be reconstructed from the matrix which contains the distances between all pairs of languages in a family. Recently, we proposed a new method which uses normalized Levenshtein distances among words with same meaning and averages on all the items of a given list. Decisions about the number of items in the input lists for language comparison have been debated since the begin...

متن کامل

Levenshtein Distances Fail to Identify Language Relationships Accurately

The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into genealogical subgroups. In this article I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1104.4426  شماره 

صفحات  -

تاریخ انتشار 2011